Ecobici is a government sponsored program to encourage people to use bycicles as a means of transportation in Mexico City. Users of the program pay an annual fee to borrow a bike at any time for a maximum duration of 45 minutes. There are bike stations in different parts of the city and one can take a bike on any station and return it at any other.
While the program has enjoyed good general reception since it was launched in 2010, there is still a lot that can be done to further increase its adoption. In order to create a successful expansion plan, one must gather knowledge about the factors that seem to influence adoption, though. One way to do that is by understanding the current state of adoption faceted by different traits, such as gender, age, location, etc.
For this project, we’ll attempt to answer the following questions:
Surely other questions will pop up as we do the analysis, and we’ll try to answer those as well.
The dataset we’ll be analyzing in this project is that of users of the Ecobici program who signed up between February 15, 2010 and December 31, 2013. This and other datasets related to Ecobici can be downloaded from Mexico City’s Data Labs official website.
The columns of the dataset include:
(Later on when we prepare the dataset, you’ll notice that, in the registration.type variable, there’s one value called “ALTA TELMEX”. Telmex is the largest phone company in Mexico, and has partnered with Mexico City’s government to allow payment of the annual fee for Ecobici through a user’s phone bill.)
We’ll also explore some variables that are not included in the previous list, but which can be easily derived from the original dataset:
Before we begin, we’ll go through a couple of steps to put the dataset in a form more suitable for analysis.
We’ll read in the data from the ecobici_usuarios.csv file, specifying that we don’t want to read strings as factors. This is because we don’t want dates to be turned into factors. We also want to strip whitespace so as to avoid additional spurious factor levels later on.
## [1] "Spanish_Spain.1252"
Our dataset contains 111935 entries of 11 variables, and by running the str command, we can see that all columns are named in Spanish:
## 'data.frame': 111935 obs. of 11 variables:
## $ USUARIO : int 145 669 856 865 26538 956 28702 901 990 980 ...
## $ TARJETA : chr "2938835630" "2938833614" "2861480206" "2935342734" ...
## $ SEXO : chr "M" "M" "M" "M" ...
## $ FECHA.DE.NACIMIENTO : chr "1964-11-19" "1978-08-14" "1979-04-01" "1978-09-07" ...
## $ COLONIA : chr "El Prado" "Zona Escolar" "Condesa" "Cuauhtémoc" ...
## $ DELEGACION : chr "Coyoacán" "Gustavo A. Madero" "Cuauhtémoc" "Cuauhtémoc" ...
## $ ESTADO : chr "D.F." "D.F." "D.F." "D.F." ...
## $ MEDIO.DE.INSCRIPCION: chr "ALTA WEB" "ALTA WEB" "ALTA" "ALTA WEB" ...
## $ FECHA.DE.INSCRIPCION: chr "2010-02-16" "2010-02-18" "2010-02-19" "2010-02-19" ...
## $ USOS : int 0 0 0 0 0 0 0 0 0 0 ...
## $ STATUS : chr "Vigente" "Vigente" "Vigente" "Vigente" ...
So we’ll translate them to English to make it easier for everyone to understand what each variable stands for:
## [1] "user.id" "card.id" "gender"
## [4] "birthday" "borough" "municipality"
## [7] "state" "registration.type" "registration.date"
## [10] "rides.count" "status"
Some of our categorical variables (registration.type, status) also have names in Spanish, or are not totally clear (gender). We’ll translate those as well.
## [1] "ALTA WEB" "ALTA WEB" "ALTA" "ALTA WEB" "ALTA" "ALTA"
## [1] "Vigente" "Vigente" "Vigente" "Vigente" "Inactivo" "Vigente"
## [1] "M" "M" "M" "M" "M" "F"
After a little bit of cleaning and parsing (see the accompanying Rmd file for details), we get the following structure:
## 'data.frame': 110922 obs. of 11 variables:
## $ user.id : int 145 669 856 865 956 28702 901 990 980 28269 ...
## $ card.id : Factor w/ 110050 levels "029170de","03909c2c",..: 78910 78894 57399 73629 69873 65478 71733 71608 79109 53457 ...
## $ gender : Factor w/ 2 levels "female","male": 2 2 2 2 1 2 2 2 2 1 ...
## $ birthday : POSIXct, format: "1964-11-19" "1978-08-14" ...
## $ borough : Factor w/ 5502 levels "1 ampl presidentes",..: 1425 5497 1063 1158 4807 334 3522 4072 2083 4351 ...
## $ municipality : Factor w/ 222 levels "acambay","acolman",..: 44 67 48 48 108 108 22 48 48 22 ...
## $ state : Factor w/ 30 levels "aguascalientes",..: 7 7 7 7 7 7 7 7 7 9 ...
## $ registration.type: Factor w/ 3 levels "normal","telmex",..: 3 3 1 3 1 1 1 3 1 1 ...
## $ registration.date: POSIXct, format: "2010-02-16" "2010-02-18" ...
## $ rides.count : int 0 0 0 0 0 0 0 0 0 0 ...
## $ status : Factor w/ 2 levels "inactive","active": 2 2 2 2 2 1 1 2 2 1 ...
## - attr(*, "na.action")=Class 'omit' Named int [1:1013] 5 15 910 1419 2395 2437 2600 2652 2922 3036 ...
## .. ..- attr(*, "names")= chr [1:1013] "5" "15" "910" "1419" ...
And the values for our categorical values are now clearer:
## registration.type status gender
## 1 web active male
## 2 web active male
## 3 normal active male
## 4 web active male
## 6 normal active female
## 7 normal inactive male
Now we’ll extract some extra variables. We’ll begin with tenure (the number of days since the user registered):
## 'data.frame': 110922 obs. of 1 variable:
## $ tenure: num 1414 1412 1411 1411 1411 ...
Then we’ll add age:
## 'data.frame': 110922 obs. of 1 variable:
## $ age: num 49 35 34 35 39 3 3 40 62 3 ...
As is usual with datasets in the real world, there’s some bad data that we need to filter out. For our dataset, it turns out that there are 26 people who were born in the future or were riding an adult bike at age 3 or less!
## [1] 26 13
So, let’s remove those and get ready for our exploratory data analysis. Let’s get started!
To get a bird’s-eye overview of our dataset, let’s create a scatterplot matrix:
Some of the resulting plots are very easy to interpret (e.g. histograms of categorical variables), while others, such as the histogram and density plots of rides.count require further investigation to understand completly. In the following sections, we’ll try to understand such plots better.
Let’s begin by taking a look at summary of the variables of interest, one by one in order to avoid the crammed output that doing summary(users) would give us:
Age
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 28.00 33.00 35.29 41.00 83.00
Tenure
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 191.0 386.0 544.9 1015.0 1415.0
Number of rides
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 6.00 34.92 38.00 1254.00
Gender
## female male
## 42311 68585
Status
## inactive active
## 13216 97680
State, Municipality and Borough
## state municipality borough
## d.f. :95822 cuauhtémoc :35743 roma norte : 6124
## edo. mex.:14843 miguel hidalgo :19675 condesa : 5203
## hidalgo : 28 benito juárez : 8857 cuauhtémoc : 4617
## morelos : 26 gustavo a. madero: 5800 juárez : 2731
## querétaro: 21 álvaro obregón : 4916 hipódromo : 2557
## jalisco : 20 coyoacán : 3930 polanco v sección: 2444
## (Other) : 136 (Other) :31975 (Other) :87220
Type of Registration
## normal telmex web
## 105158 977 4761
Birthday and Date of Registration
## birthday registration.date
## Min. :1930-05-30 00:00:00 Min. :2010-02-15 00:00:00
## 1st Qu.:1972-08-17 00:00:00 1st Qu.:2011-03-22 00:00:00
## Median :1980-08-12 00:00:00 Median :2012-12-10 00:00:00
## Mean :1978-03-18 01:14:23 Mean :2012-07-04 01:26:28
## 3rd Qu.:1985-09-09 00:00:00 3rd Qu.:2013-06-23 00:00:00
## Max. :2000-04-05 00:00:00 Max. :2013-12-31 00:00:00
Some facts we can see right away include:
The number of registered female users (42311) versus male users (68585)
The oldest birthday (May 30 1930)
The states with most users: D.F. (Federal District) with 95822 users, followed by Edo. Mex. (State of Mexico) with 14843 users
The municipalities with most users: Cuauhtémoc with 35743 users and Miguel Hidalgo with 19675 users (which comes as no surprise, given that most bike stations are in those areas.)
The range of ages goes from 13 to 83 years old
The number of active users (97680) vs the number of inactive users (13216)
We’ll explore these facts and others in much more detail in the following sections.
One of the original questions posed at the beginning of this project was: Does age play a role in the adoption of Ecobici? The following histogram may give us some insight into the answer:
A couple of things are immediately obvious:
The skewness of the distribution is most likely due to the fact that Ecobici requires a credit card or phone bill invoice for registration. Underage people can still register, but they need to get written approval from their parents.
Let’s add some more detail to the graph to investigate the mean and median, as well as the interquartile range (IQR):
The median seems to be around 33, with the mean at around 35, a consequence of the longer right tail of the distribution. The mode, which represents the largest age group in the program, is right at 28 years old.
Let’s now compare the female and male populations in a frequency polygon plot:
It’s clear that the distributions are somewhat similar in shape but men dominate in number. To see this more clearly, let’s draw them in different plots in the same column, along with their mean, median and IQR:
Now it’s easier to see that the male population tends to be a bit older than the female population. Also, it seems like the variance of the male population is greater. Let’s compute the standard deviation to confirm or deny this perception:
## users$gender: female
## [1] 9.747954
## --------------------------------------------------------
## users$gender: male
## [1] 10.53073
And, indeed, the male population has a greater variance.
Finally, let’s create a boxplot that includes the very first outliers to get another look at the same data:
One additional piece of information that the boxplot gives us is the age at which men or women start being considered “rare” (the “outlier thresholds”) in their respective populations. Notice that in the case of men, we see the first outliers after age 63, while in the case of women it is after age 57. This supports our hypothesis that the male population tends to be older.
Let’s now explore the distribution partitioned by type of registration:
It’s obvious from this plot how an overwhelming majority of users signup using the traditional means (as opposed to signing up on the web or using Telmex), but also how the “web” population dominates over the “telmex” population, albeit by a small margin (at least as compared to how the “normal” population dominates over the two.)
Let’s now compare the three populations on a free scale with a more elaborate histogram:
Clearly, the Telmex population seems to be much younger, though it’s unclear why this might be. One possible explanation is that younger people are less likely to have a credit card, so they may be using their phone bill (or their parents’) to join the program.
Running a summary by type of registration, we confirm that the Telmex population does tend to be younger:
## users$registration.type: normal
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 28.00 33.00 35.27 41.00 83.00
## --------------------------------------------------------
## users$registration.type: telmex
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 16.00 24.00 29.00 32.35 39.00 75.00
## --------------------------------------------------------
## users$registration.type: web
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 16.00 29.00 34.00 36.43 42.00 79.00
It also seems that the Telmex population has greater variance. Computing the standard deviation we can see whether this is true:
## users$registration.type: normal
## [1] 10.28671
## --------------------------------------------------------
## users$registration.type: telmex
## [1] 11.24539
## --------------------------------------------------------
## users$registration.type: web
## [1] 9.996479
Finally, let’s make a plot where we can see the distribution of age by gender and type of registration to see whether women or men have a preference in the way they sign up to the program:
There appears to be no significant difference in the distributions of men and women in either the normal or web populations, but it could be argued that there’s something going on in the case of Telmex where the variations in the distributions of men and women are less in sync than in the other populations. This is an intriguing finding, but we have no further data to make a hypothesis about it.
Let’s now proceed to analyze the number of rides for users:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 6.00 34.92 38.00 1254.00
We seem to have a heavily skewed distribution, with a very thin and long tail, from which we can deduce that there’s a bunch of users each with a large number of rides. We can also see this if we tally the frequency of the number of rides and see it in reverse order:
##
## 763 767 770 772 781 789 801 817 822 825 838 872 878 888 909
## 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1
## 915 944 946 1077 1254
## 1 1 1 1 1
To get a better visualization, we’ll apply a scale transformation on the x axis:
There’s an important and very evident finding here: a rather large number of people signed up for the program, used it a couple of times and then stopped using it. Let’s see the exact numbers:
##
## 0 1 2 3 4 5 6 7 8 9
## 40280 3857 3403 2809 2444 2195 1866 1754 1587 1502
There is a very large number of people with no rides. Could this be bad data or is it true that almost 37% of the users signed up for the program but then never actually used it at all?
Let’s add more information to the previous plot:
The median number of rides appears to be 6. Let’s confirm the numbers analytically:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 6.00 34.92 38.00 1254.00
This has really caught our attention. Let’s see what proportion of the total number of users have done less than the median number of rides:
## [1] 0.5126785
This number is quite surprising, around 51% of users are people who have just tried the program a couple of times! Certainly not what we might call “active users”. This is an important finding.
Now, let’s analyze whether this phenomenon appears regardless of gender:
In the case of women, the phenomenon seems to be much worse with a median of 3 rides, which we can confirm analytically:
## users$gender: female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 3.00 25.38 26.00 944.00
## --------------------------------------------------------
## users$gender: male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 8.00 40.81 48.00 1254.00
## users$gender: female
## [1] 26
## --------------------------------------------------------
## users$gender: male
## [1] 48
These last numbers show something surprising: women have a median of 3 rides, while men have a median of 8 rides. The means are higher because of outliers in both groups that drag them up to 40.8117518 in the case of men, and 25.3770414 in the case of women.
Let’s the see the proportion of women (relative only to the female population) who have taken a number of rides equal to or less than the median:
## [1] 0.5013117
Despite the lower median, the proportion of “active users” (informally defined here as those who have taken more than the median number of trips in their population) for women is actually quite close to that of the general population.
However, if we just concentrate on the values below the median, and we look at the proportions, rather than the full counts, it’s immediately obvious that the phenomenon is a bit more pronounced in the case of women:
From this last plot, there’s no doubt that the most devoted “active” users of Ecobici are men (people with over a 1000 rides.) Also interesting is that despite the downward trend after about 25 rides for women, they make an important comeback when getting near 1000 with a proportion of around 0.30
Finally, let’s see what additional information about outliers we can glean from a boxplot:
Well, it seems that women with over a 150 rides are quite rare as far as their population goes, while men need to surpass the 300 rides to be considered rare.
We’ll now take a look at tenure, the number of days a user has been in the program:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 191.0 386.0 544.9 1015.0 1415.0
The distribution shown by the previous plot tells us at least three things:
Most people using Ecobici joined relatively recently (within the last year and a half or so).
The second biggest group of users seems to correspond to the early adopters back in 2010 when Ecobici was first launched.
Something happened after the second year of operation that caused a diminished number of new users. Maybe there was just not enough advertising or only a limited number of bike stations was available, thus limiting the number of potential users. However, that clearly changed during the third year of its operation.
Let’s change the binwidth of the histogram to 1 to see if we notice anything unusual:
The overall shape of the distribution seems the same, but we do see two or three very distinct points with well over 400 counts. We’re not sure what these might represent, but they don’t seem to warrant much attention.
Another comparison we can make is whether the Federal District and the State of Mexico have the same tenure distributions. Let’s find out:
The overall shapes of the distributions are quite similar, but the State of Mexico does have a lower median and mean, and generally lower amounts of people with tenures between 300 and 500 days. This means that for some reason after its first year of operation, the number of new Ecobici users coming from the State of Mexico declined much more rapidly than its counterpart in the Federal District.
Finally, let’s see whether faceting by gender shows anything interesting:
The distributions are really similar in shape, so the answer appears to be no.
Our dataset contains three variables related to location. We’re interested in seeing which locations have the most users. For the case of states there’s really no point in plotting in detail, as the Federal District and the State of Mexico dwarf all other states, as can be seen in the following plots:
In the case of municipalities, we’ll focus on those in the Federal District:
For someone who lives in the Federal District, it’s obvious that there’s more municipalities in this graph than there should be. The Federal District is divided into 16 regions (known in Spanish as Delegaciones), and this graph shows way more regions. Taking a cursory look at the regions displayed, it appears that some of them are part of the State of Mexico. Unfortunately, there’s many of them obscuring those that we care about.
So it seems like we’ve hit a dead-end here, as a proper analysis of users by state and municipality requires the data to be correct. One thing we could do is to ensure consistency between municipalities and states, but doing so requires an official listing of municipalities per state (a “golden standard”), as well as applying some fuzzy matching techniques to ensure all municipalities have a single unique name (as it stands right now, it’s quite possible that the same municipality name was input using just slightly different names, due to spelling errors, data corruption, lack of accents, etc.)
Nevertheless, it’s likely that, even if states and municipalities are not matched up correctly in many cases, the rest of the data is fine (i.e. people did write correctly their municipality and borough of residence), so we can still plot a histogram including the municipalities and boroughs with most registered users.
We’ll do just that by sorting the regions and making them an ordered factor before we plot (you can see the details in the Common Functions section above):
Clearly, the distribution is really far from uniform, with just two or three municipalities leading the charts.
Let’s do the same plot for boroughs now:
We can also see if there are places, amongst those with most users, where the proportion between men and women deviates significantly from that of the general population:
In general, men seem to be majority, but never exceeding 65%. Let’s see whether we can find a place where this exceeds that by including more boroughs in our plot:
Here we notice for the first time a significant variation: the borough “Centro (Area 1)” (part of Downtown Mexico City) does have a proportion of men close to 69%, the biggest thus far. This is somewhat surprising, as intuitively one would expect that region to have a more balanced proportion.
Having become acquainted with the individual variables of the dataset, we’ll now proceed to look at the relationships between continuous variables in it.
For some of the plots involving the whole dataset, as opposed to summaries, we’ll be using a sample of 30,000 rows to the make our plotting faster.
We’ll begin by looking at the relationship between age and tenure. We wonder who tends to stick to the program longer, young people or older people:
Well, other than showing the grouping of people in three large clusters of tenure, the graph doesn’t really tell us much more. Let’s add a bit of jitter and some alpha to the points to see if we uncover anything about the distribution of points:
That’s just slightly better, but it doesn’t give us new information, all we see is things that we knew from the univariate analysis already (e.g. most of the population is young and middle age adults.)
There’s just too much data on the plot, so let’s try using a summary instead. We’ll add both the median and mean to the graph so we can compare, but we’ll use the median as a more representative measure of the population as a whole because it’s a more robust statistic in the presence of outliers:
Now, this is much better. We can see a trend more clearly. Let’s add a smoother to make it even more obvious:
The story that this plot tells us is that the early adopters of Ecobici were seniors. The rest of the population came on board later on, thus having less tenure. Also, the bowl-shaped curve of the regression line tells us that there’s been a wave of young adults (around 30 years old) adopting the program more recently.
And now let’s facet it by gender and registration kind to see any possible variations in the distributions of those subpopulations:
Now, this is a very interesting plot, because it tells very different stories for people who registered online versus those who did it using Telmex or in the traditional way. For example, for the web subgroup, it appears as though more elderly women starting at age 63 have continued to join the program after the first year of operation, in contrast to elderly males who have stopped joining in more recent years.
However, it’s very important to keep in mind the smaller relative sizes of the web and Telmex populations when trying to interpret the plot. The plot itself gives us a warning of this as the variances are rather large in each case, thus decreasing our confidence that the story it seems to tell can generalize well.
Our next exploration will be getting at one of the initial questions: does age play a role in the adoption and active use of Ecobici?
During our analysis of the distribution of the age variable, we partially answered this question. However, it’s important to assess how age relates to the activity a user shows in the Ecobici program.
Let’s begin with a simple and straightforward scatterplot:
Let’s try to see how the population is distributed by adding an alpha parameter:
The bulk of the data seems to be below 300 rides, so in order to get a better view, let’s scale the vertical axis, change the plotting color to something light, and add two-dimensional contour lines to see where most points concentrate:
Now we can see what we discovered in the univariate analysis regarding the large number of users with just a few or zero rides. We can also see how the maximum number of points concentrates at around age 28 with a number of rides that varies between 10 and 100.
It would be nice if we could incorporate information about tenure in this same plot. One way to do it is to divide tenure into a discrete number of intervals and color code the points according to that.
Let’s begin by creating a new variable tenure_in_semesters:
And we’ll also facet by gender to include even more information in the plot
Now we have even more information in one plot: we can easily see that most of the users with zero rides also have large tenures, which is interesting but still hard to understand. Was there a massive signup of users in the beginning of the program as part of a large media event, maybe?
A couple of other things are revealed by this plot:
If we really want to understand how active users are, we need to plot not the raw count of rides, but instead something that reveals their daily activity, such as the average number of rides per week that they’ve taken. Let’s do that now:
With this last plot, we can confirm that users who joined in the last year or two tend to be more active on average than those who joined earlier.
We’ve discovered quite a lot of things so far using scatterplots. Let’s now turn our attention to creating plots that use summary statistics on the number of rides per age group rather than the raw counts:
Here there’s a clear trend: young people between the ages of 15 and 25 are the ones who take the most rides, and there’s a steady decline in the number of rides as people get older. There are, of couse, outliers that drag the mean upwards, but we care only about the typical user for the most part.
As we’ve done before, we’ll check what the distributions look like for different segments of our population, but the same caveats mentioned in the last section apply here regarding the remarkable differences in the distributions for Telmex and web registration types:
One thing worth noticing, though, is the difference in the distributions between men and women in the normal subpopulation. Notice how the trends seem to reverse betwen the ages of 70 and 80, going down in the case men and going up significantly in the case of women. Could this be due to the presence of outliers in a relatively small group? Let’s keep in the mind that the number of people in that age range is just 0.3922594% of the total.
Our analysis of the Ecobici users has helped us answer all of the questions we had in mind before beginning our exploration, but it has also revealed some unexpected findings. In the following sections, we’ll recapitulate the analysis done and present three plots that gave us important insights about the data.
While it was clear from the beginning that men used Ecobici more than women, the extent to which this was true wasn’t so clear. During the analysis, we found out that there was a downward trend in the proportion of women that completed an ever increasing number of rides, up to the point where men were the only ones to complete more than a thousand rides.
Also interesting in this plot is the fact that, if we divide the population by the type of signup process they used to enroll in the program, the distributions of each group do change substantially.
We take this result with a grain of salt, however, given that the sizes of the Telmex and web groups are orders of magnitude smaller than that of the “normal” group, so it’s likely that they will have much higher variance and may not faithfully reflect the true underlying distribution (assuming that there is indeed such an out-of-sample distribution which is substantially different from the “normal” distribution)
In this plot there’s also a hint of another phenomenon that we’ll see more clearly in the following section, namely that there’s a large proportion of people who signup for the program but end up taking just a few rides (or none at all.)
Let’s also take a look at the statistics on the number of rides by gender. Let’s run the by command passing the summary and sd (standard deviation) functions to get these results.
First the summary function:
## users$gender: female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 3.00 25.38 26.00 944.00
## --------------------------------------------------------
## users$gender: male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 8.00 40.81 48.00 1254.00
And then the sd function:
## users$gender: female
## [1] 51.97177
## --------------------------------------------------------
## users$gender: male
## [1] 73.78403
Clearly, the number of rides completed by the median person is extremely low, especially in the case of women, and their lower standard deviation also shows a less diverse population. We’ll see this result graphically in the following plot as well.
One might be quick to think that this is surely the result of the relatively large number of newcomers to the program, but sadly an even worse result comes up when we filter out people who joined in the last six months (using 31 December, 2013 as the reference date):
## older_users$gender: female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 1.00 24.83 22.00 944.00
## --------------------------------------------------------
## older_users$gender: male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 4.00 40.27 43.00 1254.00
This is a strong indication that, despite the large number of people who have joined, the program didn’t really take off, and people just didn’t use it quite as much as we might have expected.
It’s only the most enthusiastic of users that drag the mean upwards to about 25 rides in the case of women and to about 41 for men. This is especially true in the case of women where we can observe the third quartile being smaller than the mean.
Another thing we might be interested in seeing, rather than just the raw counts, is the average number of rides people have taken per week during the time they’ve been in the program. This is not hard to compute, as we have the tenure available:
## valid_users$gender: female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.09013 0.87690 0.80770 38.61000
## --------------------------------------------------------
## valid_users$gender: male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.2121 1.4160 1.5620 66.1000
The result shows that, viewed as a whole, men appear to use Ecobici twice as much per week as women, but again the average number of rides per week doesn’t exceed two for either gender, which is sad news for the program, of course.
The Ecobici dataset contains a field called status which tells us whether a given user is currently paying their annual fee, but that is obviously not enough to tell whether they actually use the program or not and to what extent.
So, to find this out, we analyzed the number of rides people have taken against their age, the number of years they’ve been enrolled in the program, as well as their gender, in hopes of finding patterns:
This plot reveals several interesting things:
Newcomers tend to have few rides in their history, as expected
The largest age groups in women seem to be confined to a more narrow range than those of men (notice how the two-dimensional density lines are more tightly packed and more to the left.) In other words, the female population tends to be younger than the male population in general, and their age range is narrower than that of men.
A majority of users joined the program in the last one or two years
Most teenagers and other young people have joined the program very recently
Users who joined the program recently appear to be more active than those who joined earlier. We confirmed this during our analysis by plotting not against the raw number of rides for each user, but rather against the average number of rides per week.
The horizontal stripes at the bottom of the plot shows that there is a large number of people who signed up for the program, took a few rides and then stopped using it. There are even people who didn’t use the service at all. In this latter case, they appear to be mostly people who signed at the beginning of the program back in 2010 (notice the dark blue stripe at the bottom.)
From the graph, it’s not very clear whether there is a correlation between age and the number of rides a person takes, but we can run an analytical test to get this information:
##
## Pearson's product-moment correlation
##
## data: users$age and users$rides.count
## t = -22.548, df = 110890, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.07341162 -0.06169413
## sample estimates:
## cor
## -0.0675552
The result is negative and quite small, so it’s hard to make any conjectures. But this is understandable, after all there is quite a lot of variation in each age group (this is also quite evident from the shape of the plot.) So what if we took the average (and median) number of rides per age and then ran the test again. Let’s find out for the average first:
##
## Pearson's product-moment correlation
##
## data: age and rides.count.mean
## t = -3.4549, df = 69, p-value = 0.0009456
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5665722 -0.1655609
## sample estimates:
## cor
## -0.384031
And then for the median:
##
## Pearson's product-moment correlation
##
## data: age and rides.count.median
## t = -8.2672, df = 69, p-value = 6.488e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.8060657 -0.5651258
## sample estimates:
## cor
## -0.7054224
This seems like a more significant result. There seems a strong negative correlation between age and number of rides, meaning that the older you are the less likely you are to take a ride in Ecobici (even if you did have good intentions and joined the program.) We saw this during our exploratory analysis too, but the numbers here confirm the result analytically. This result aligns well with our intuition about the relationship between the two variables.
Finally, we’ll present a plot of age versus the median number of years enrolled in the Ecobici program, faceted by gender and type of registration:
This plot really tells us a lot of things! But we have to be careful in interpreting the results and keep in mind that the “web” and “telmex” populations are really small when compared to the “normal” population.
The story that the plot for the “normal” population tells us is that the early adopters of Ecobici were seniors, and rest of the population came on board later on. Also, the bowl-shaped curve of the regression line tells us that there’s been a wave of young adults (around 30 years old) adopting the program more recently.
The previous remarks appear to be true in the case of the “web” population as well, but doesn’t really hold in general for the “telmex” population. To try and confirm (or deny) our conjecture, we can run a correlation test between age and the median number of years enrolled in the program (“median tenure”) for the whole population:
##
## Pearson's product-moment correlation
##
## data: age and tenure.median
## t = 5.7704, df = 184, p-value = 3.3e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2623470 0.5067868
## sample estimates:
## cor
## 0.3914505
The result is positive but not very large, so even though age might play a role in determining tenure, it appears to be a small one in general.
Let’s see what happens if we segment by type of registration and run the same test for the “normal” population:
##
## Pearson's product-moment correlation
##
## data: age and tenure.median
## t = 2.4958, df = 69, p-value = 0.01496
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.05836036 0.48827121
## sample estimates:
## cor
## 0.2877462
For the “telmex” population:
##
## Pearson's product-moment correlation
##
## data: age and tenure.median
## t = 2.6219, df = 54, p-value = 0.01134
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.08023422 0.55032723
## sample estimates:
## cor
## 0.3360463
And, finally, for the “web” population:
##
## Pearson's product-moment correlation
##
## data: age and tenure.median
## t = 10.153, df = 57, p-value = 2.154e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6876749 0.8780877
## sample estimates:
## cor
## 0.8024454
It’s clear that faceting the population by type of registration does yield different insights about each subpopulation. For the “normal” and “telmex” population, the correlation coefficients are close enough that we can think of them as being more or less the same, but for the “web” population the correlation is much stronger and turns out to be an intriguing result because our intuition is that older people are less prone to use the internet (at least in Mexico.)
To close this section, it’s important to keep in mind that the differences in the relative sizes of the subpopulations make it hard to believe with confidence that the distributions depicted by the regression lines are indeed a faithful reflection of the underlying distributions, and future work should focus on getting much more data on these subpopulations to increase our confidence in the results.
We started our exploration with some initial questions in mind, but we also knew we might end up exploring other things as we dug deeper into the relationships hidden in the data.
The scatterplot matrix was a very useful tool to give us a very quick overview of what laid ahead, as it gave us some direction of what might be interesting to look at.
Our analysis then began with an attempt to understand each variable in isolation: histograms, augmented with lines showing its mean, median and interquartile range, were of great help here. Boxplots gave use additional information about what was considered an outlier in each distribution. We also made use of faceted views and coloring to visualize single variables for different subgroups (e.g. males vs. females, people who signed up offline vs. on the web.)
Our analysis then proceeded to consider the relationship between pairs of continuous variables, and we realized that sometimes a direct plot of a variable against another doesn’t really provide much insight into the data. That’s when we turned to using summary statistics. We tried using both the mean and the median, but we settled for the median when trying to make conjectures because our dataset contained too much variation and outliers, making the mean a poor choice to understand the typical member of a population.
One of the initial difficulties we ran into during the analysis was related to the fact that the data wasn’t completely tidy. We realized this as we were doing the visualizations and had to step back and clean it up a bit before proceeding. We tried to mitigate this to some degree with some simple cleaning steps, but we were aware that more work would be needed for a completely clean dataset.
Another difficulty we had was interpreting some of the data correctly. Specifically, in the case of the subpopulations defined by the kind of registration, we were initially inclined to believe that the differences we saw in their distributions and statistical summaries might be an indication of a truly different kind of population. But, as progressed through the analysis, we came to realize that we should be very careful to jump to conclusions because those subpopulations were very small in size and therefore very prone to high variance.
We were able to answer all of the initial questions we had in mind, and we were also lucky to make some important discoveries regarding the adoption of the Ecobici program that may be very valuable to improve the program.
One of those discoveries is the unfortunate pattern that a considerable proportion of the people who sign up to the program end up using it just a couple of times, something that happened even more frequently in the case of women.
Another discovery we made was that, despite getting stuck in its growth during its second year of operation, the program made a comeback in its third and fourth year that attracted a lot of young people who are very active in terms of rides per week.
One piece of future work that would be fantastic is to create a heatmap of Mexico City showing the levels of adoption by region (i.e. municipality). This requires, of course, cleaning up the data more thoroughly, and getting a shapefile of all the municipalities that comprise Mexico City. With that geospatial data, we could easily visualize other things too, such as user adoption faceted by gender, by “activity status” (as measured by the number of rides in a year), etc.
Other possibilities for future analysis include merging other datasets (or their summaries) with this one to find even more interesting facts about the users of Ecobici. For instance, the website where we obtained the dataset used in this analysis, also makes available datasets regarding Ecobici stations and all rides taken since its launching in 2010. This could help us answer questions such as: what’s the average time that people of different ages exercise using Ecobici? What are some of the most/least commonly used stations at certain times during the day? Where should we place the next Ecobici station so that it benefits active and devoted users the most?